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Abstract 

Background: Bacterial 16S Ribosomal RNAs profiling have been widely used in the classification of microbiota 
associated diseases. Dimensionality reduction is among the keys in mining high-dimensional 16S rRNAs' expression 
data. High levels of sparsity and redundancy are common in 16S rRNA gene microbial surveys. Traditional feature 
selection methods are generally restricted to measuring correlated abundances, and are limited in discrimination 
when so few microbes are actually shared across communities. 

Results: Here we present a Feature Merging and Selection algorithm (FMS) to deal with 16S rRNAs' expression 
data. By integrating Linear Discriminant Analysis method, FMS can reduce the feature dimension with higher 
accuracy and preserve the relationship between different features as well. Two 16S rRNAs' expression datasets of 
pneumonia and dental decay patients were used to test the validity of the algorithm. Combined with SVM, FMS 
discriminated different classes of both pneumonia and dental caries better than other popular feature selection 
methods. 

Conclusions: FMS projects data into lower dimension with preservation of enough features, and thus improve the 
intelligibility of the result. The results showed that FMS is a more valid and reliable methods in feature reduction. 



Background 

The biogeography of microbiota in the human body are 
linked intimately with aspects of host metabolism, physiol- 
ogy and susceptibility to disease [1,2]. Previous studies have 
identified that dysbiosis of the distribution or infection of 
pathogenic microbiota would lead to some human diseases, 
such as pneumonia [3], dental caries [4], cutaneous disease 
[5], or other disease [6,7]. Characterization of the abundant 
and rare microbiota represents essential groundwork to 
human's health [3,8] . Knowledge of the human microbiome 
has been expanded greatly by various techniques such as 
16S rRNA gene sequencing and metagenomics, etc. Gene 
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expression sequencing enables the simultaneous measure- 
ment of the expression levels of thousands of genes. Like 
gene selection, the curse of dimensionality also applies to 
the problem of microbiota classification [9,10]. 

The ability to successfully distinguish between disease 
classes using gene expression data is an important aspect 
of approaches to disease classification, the discrimination 
methods include nearest-neighbor, linear discriminant 
analysis, and classification trees etc [11]. The nature of 
gene expression data and its acquisition means that it is 
subject to the curse of dimensionality, the situation 
where there are vastly more measurable features (genes) 
than there are samples. Dimension reduction methods 
are much used for classification or for obtaining low- 
dimensional representations of datasets. Traditionally, 
there are two types of methods used to reduce dimen- 
sionality. One is feature selection and the other is feature 
transformation [12]. Feature selection techniques do not 



© 2012 Wang et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons 
BiolVlGCl C^ntrBl Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in 
any medium, provided the original work is properly cited. 



Wang et al. BMC Systems Biology 2012, 6(Suppl 3):S12 
http://www.biomedcentral.eom/1 752-0509/6/S3/S1 2 



Page 2 of 12 



alter the original representation of the features, but 
merely select a subset of features derived from the large 
set of profiles. Three kinds of feature selection were 
widely used: filter methods, wrapper methods and 
embedded methods [13]. However, most of existing fea- 
ture selection methods reduce a feature space of high 
dimensionality into a manageable one at the cost of los- 
ing the relationship between different features. 

Contrasted with feature selection, feature transforma- 
tion methods create a new feature space with an optimal 
subset of predictive features measured in the original 
data. Some traditional feature transformation methods, 
such as principal component analysis (PCA) and linear 
discriminant analysis (LDA), output a combination of 
original features. PCA converts a set of possibly corre- 
lated variables into a set of orthogonal factors that effi- 
ciently explain the variance of the observations. LDA 
transforms original features to k-1 dimensions if there 
are k categories of training data. These traditional meth- 
ods are fast and easy to compute, but there are some 
weakness [14], like that not all the discrimination vec- 
tors obtained are useful in pattern classification and that 
features of different dimensions are overlapping, thus it 
is often difficult to interpret the results. 

Previous surveys showed that taxon relative abundance 
vectors from 16S rRNA genes expression provide a 
baseline to study the role of bacterial communities in 
disease states [15-17]. However, high levels of sparsity 
are common in 16S rRNA gene microbial surveys, pre- 
senting the fundamental challenge for their successful 
analysis. Identifying which microbes will produce good 
discrimination remains challenging when so few 
microbes are actually shared across communities. 
Besides, in a typical study of microbiota, 16S rRNAs' 
expression level of different samples might be redun- 
dant. Traditional feature selection methods are generally 
restricted to measuring correlated abundances, and are 
limited in their ability to maintain the information due 
to the removal of redundant features. In microbiota ana- 
lysis, it is critical to preserve enough features to improve 
the intelligibility with minimized classification error rate 
and effectively reduced feature dimension simulta- 
neously. To solve these problem, we introduced an 
improved Feature Merging and Selection algorithm 
(FMS in short) to identify combinations of 16S rRNA 
genes that give the best discrimination of sample groups. 
FMS extracts essential features from the high dimension 
feature space, then, an efficient classifier is employed 
with a lower classification error rate, to project data into 
lower dimension and preserve enough features and thus 
improve the intelligibility of the result. The perfor- 
mances were tested by 16S rRNAs' expression datasets 
of pneumonia patients and that of dentes cariosus 
patients. 



Results 

Feature Merging and Selection algorithm 

Two statistics methods were considered to handle the 
continuous and sparse data of 16S rRNAs' expression 
levels. Fisher statistic was used to test the classification 
ability of features and Pearson Correlation Coefficient was 
used to describe the redundancy between features. We 
developed a new method called Feature Merging and 
Selection algorithm, which combined Linear Discriminant 
Analysis (LDA) method to learn linear relationship 
between different features. Classical LDA requires the 
total scatter matrix to be nonsingular. However, in gene 
expression data analysis, all scatter matrices in question 
can be singular since the data points are from a very high- 
dimensional space and in general the sample size does not 
exceed this dimension. To deal with the singularity pro- 
blems, classical LDA method was modified in a way that 
an unit diagonal matrix with small weights was added to 
the within-class scatter matrix. The procedure continued 
until the remaining matrix eventually became nonsingular. 

FMS algorithm consists of two parts: feature merging and 
feature deletion. Feature merging is the main part of the 
algorithm. The procedure is described below (see Figure 1): 

Step 1: Initialization: set weights of all the features to 
1 and the counter to 0; label each feature from 1 to 
n, n is the total number of features. 
Step 2: Loop from step 2 to step 7 until the counter 
equals to n-1. 

Step 3: Delete features of zero variance, and add the 
total number of deleted features to the counter. 
Step 4: Compute pairwise relationship of the 
remaining features using modified LDA, and pre- 
serve the combination features with maximal Fisher 
statistics. The Fisher statistics is defined as 
J2n k (m k -m) 2 /(K-l) 

— , where K is the total num- 

EK - lW/(n-K) 

k 

ber of classes, n is the size of all the samples, is 
the size of the kth class, m k is the mean value of 
the sample within the Ath class, m is the mean 
value of all the samples, and o£ is the variance of 
the fah class. 

Step 5: Measure the combination ability by combining 
Fisher statistics method and Pearson correlation coeffi- 
cient methods, and calculate the merging value = (new 
value of Fisher statistics)*(Pearson Correlation Coeffi- 
cient)/(geometric mean values of Fisher statistics of 
the original features). 

Step 6: Select and merge the feature pair with the 
greatest merging value, save the original labels, 
and multiply the weight by previously trained 
weight. 
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Begin 
I 

Initialize weights and counter 



Enter the loop: 




-No- 



Yes 



Delete the features, accumulate counts 



Compute pairwise relationship of the remaining features 



Measure the combination ability 



Merge the feature pair with the greatest combination ability 



Normalization weights;the counter<-counter+l 



-Yes- 




Exit the loop; 



Re-compute weights of each combination 



End 



Figure 1 FMS algorithm flowchart 



Step 7: Normalize the weight; add 1 to the counter. 
Step 8: Re-compute the weight of each combina- 
tion using LDA until the original feature number 
is less than two. Preserve the combination with 
maximal Fisher statistics value and normalize the 
weights. 



After feature merging, the resulting combinations 
reveals the relationship between the original features. 
With more features deleted, linear bias is getting greater, 
but variance is getting lower; and vice versa. To compro- 
mise between the bias and variance criteria, we selected 
the dimension reduction ratios by 5-fold proportional 
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cross validation [12,18]. The whole data was partitioned 
into two parts as training data and test data. Training 
data was used for feature merging to learn relationship 
between features and output combination of features. 
Test data was used to estimate the error rate of feature 
merging. If there exists equal error rates among two or 
more feature merging performances, the one with the lar- 
gest merging degree will be left to obtain lower dimen- 
sion feature vectors. 

To simplify the model, features were deleted based on 
the resulting combinations after feature merging and 
cross validation. Values of fisher statistics were multiplied 
by the weight of each combination. Features were sorted 
in ascending order by absolute value of their weights and 
were deleted one by one, and the error rate were got by 
5-fold proportional cross validation. For those classifica- 
tion performances with equal error rates, the decision 
was then made to preserve the resulting combination 
with lower dimensions or less number of features. Unim- 
portant features were thus deleted to simplify the model. 
In summary, FMS determine the final dimensionality and 
thus the optimal number of features which yields the 
lowest error rate got by cross validation. FMS algorithm 
is a dimensionality reduction method and should be used 
with combination of a classifier. 

Fisher method has a high classification ability on data- 
sets with low noise, but its performance can be reduced 
because of the noisy data. To address the weakness of 
fisher method when dealing with noisy data, mutual 
information method was used for feature deletion instead 
of Fisher statistic method. Under Occam's razor [19], we 
considered classification combinations with lowest 
dimension as the simplest result. We calculated the error 
rate plus penalty with each dimension as a criteria for 
feature selection [13], and selected the first m perfor- 
mances with the lowest value, where m is log(N) and N is 
the original dimension. Weight of penalty was set as the 
range of the first m error rates divided by the range of 
relevance dimension. If the first t feature merging perfor- 
mances got same value of error rate plus penalty, then 
set m to log(N)+t-l. This method provided an alternative 
way to deal with noisy data. 

Examples of FMS algorithm 

We first tested the FMS algorithm on the 16S rRNAs' 
expression profiles got from pneumonia samples 
belonged to three classes, 101 patients with hospital- 
acquired pneumonia (HAP), 43 patients with commu- 
nity-acquired pneumonia (CAP), and 42 normal persons 
as control [3]. We assigned the 16S rRNAs' expression 
profiles into the microbe taxonomy as 16S rRNA 
sequences are often conserved within a species and gen- 
erally different between species. The expression data 
matrix was further expressed as percentage values of 



microbiota. Features with zero variance were deleted. 
The whole data was partitioned into two parts for train- 
ing and testing the model. The training data included 
profiles from 71 cases of HAP, 32 cases of CAP and 30 
cases of normal samples and the test data included pro- 
files from 30 cases of HAP , 13 cases of CAP and 12 
cases of normal samples. The training data was used for 
cross validation, and the test data was use to control the 
error rate. Five-fold proportional cross validation was 
performed on the training data to determine the degree 
of feature merging and feature deletion. 

The feature merging algorithm was then performed on 
the whole training data based on the obtained degrees of 
feature merging and feature deletion, the output reflected 
the relationship between combinations of features. Then 
the classifier was used to produce a classification on test 
data, and error rate was obtained. K-nearest neighbor algo- 
rithm (kNN) and Support vector machine (SVM) are 
widely used tools for classification. SVM was selected as 
classifier along with the algorithm because of its lower 
error rate for the pneumonia training data. Four widely 
used feature selection methods, mRMR method [20], Infor- 
mation Gain method [21], j 2 statistic [22] and Kruskal- 
Wallis test method [23] were used as controls to test the 
validity of FMS method. 

Two types of classification were considered: three-class 
problem and two-class problem. The former outputs 
three classes, i.e. HAP, CAP and normal, the later out- 
puts two classes, i.e. pneumonia (HAP and CAP) and 
normal. As SVMs are inherently two-class classifiers, 
therefore one-against-all decomposition technique was 
used to divide a three-class classification problem into 
two binary class ones. Normal samples were discrimi- 
nated from pneumonia samples at first step, then HAP 
and CAP were discriminated. For the two class problem, 
the training data was imbalanced because of the lesser 
number of normal samples compared with pneumonia 
samples. Pneumonia samples were thus clustered into 
three subgroups [24], then each pneumonia subgroup 
was mixed with data from normal samples to form a 
training dataset. The model was trained on all mixed 
datasets. Each classification performance on test data 
gave a vote to each class. 

For balanced training data, error rates obtained from 
the whole training data is suited to measure classifica- 
tion ability. However, it is not suitable for imbalanced 
data. Therefore, the mean error rate [25] of each class 
was used to measure the classification performance. The 

Fi 

error rate of the z'th category was calculated as , 

& 7 Ti+Fi 

where T t is the percentage of the z'th category of samples 
with the correct label and F t is the percentage of the z'th 
category of samples with wrong label [26]. The learning 
curves showed that the lowest error rate was achieved 



Wang et al. BMC Systems Biology 2012, 6(Suppl 3):S12 
http://www.biomedcentral.eom/1752-0509/6/S3/S12 



Page 5 of 1 2 



with 108 times feature merging performances and 8 
deleted features in 3-class problem (Figure 2a, b), and 
95 times feature merging performances and 13 deleted 
features in 2-class problem (Figure 2c, d). 

Combined with either SVM or kNN classifier, FMS 
algorithm has the lowest mean error rate in both the 3- 
class and 2-class problems compared with four other 
widely used feature selection methods, i.e. mRMR 
method [20], Information Gain method [27] , % 2 statistic 
[22] and Kruskal-Wallis test method [23]. Both in three- 
class and two-class problem, FMS algorithm reduced 
the dimension of the original data to a lower or close 
level compared with the other four commonly used fea- 
ture deletion methods, and preserved enough features 
(Table 1 Table 2). ROC curve is the representation of 
the tradeoffs between sensitivity and specificity for var- 
ious threshold values to define an abnormal test. ROC 
was constructed for each subset of features. The ROC 
curves showed that the optimal features determined by 
FMS, which were selected under the criteria of lowest 
error rate got by cross validation, reached high accuracy 
(-80%) with high sensitivity (-80%) (Additional file 1 
Figure 1 and 2), and that high specificity were obtained 
as a whole, demonstrating the feature reduction quality 



of FMS. The results showed that combined with classi- 
fier the use of FMS algorithm output lower dimension 
combinations of features and achieved lower classifica- 
tion error rate. FMS combined with SVM classifier per- 
formed better in classification than combined with kNN 
classifier, therefore FMS combined with SVM was used 
to classify the 16S rRNAs' expression profile of pneumo- 
nia samples, and the classification results were used for 
further analysis. 

Heatmap is a frequently used matrix of pair-wise sam- 
ple correlations in which anti-correlation or correlation 
is indicated by a color-scale, e.g. green to red. From the 
heatmap matrix of all original 16S rRNA's expression 
data (Figure 3a, c), similarities and differences between 
samples or genes are easily lost due to the large size of 
these visualizations. After feature extraction by FMS, the 
original space has been reduced to the space spanned by 
a few features, with data loss but retaining the most 
important variances (Figure 3b, d). The pair-wise display 
of samples indicates similarity in expression profiles 
much more clearly and with a high resolution after the 
dimensionality reduction. 

Combinations of features were sorted by their Fisher 
statistics, which indicated the discrimination ability. The 
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Figure 2 Learning curves of FMS algorithm for feature merging in 3-class problem (a), feature deletion in 3-class problem (b), feature 
merging in 2-class problem (c) and feature deletion in 2-class problem (d). 
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Table 1 Classification ability on pneumonia data in 3-class problem 


Method 


Error rate 




Dimension 


Feature number 


Note 




On training data 


On test data 








svm/FMS 


0.1895 


0.2637 


29 


1 29 




svm/mRMR 


0.2267 


0.31 03 


38 


38 




svm/KruskalWallis 


0.1984 


0.3816 


107 


107 




svm/lnformationGain 


0.2425 


0.3684 


28 


28 




svm/x2 statistic 


0.2127 


0.4308 


125 


125 




svm 


0.2841 


0.4017 


137 


137 




kNN/FMS 


0.2013 


0.3406 


112 


133 


k= 1 


kNN/mRMR 


0.2635 


0.3774 


130 


130 


k= 1 


kNN/KruskalWallis 


0.2492 


0.3795 


134 


134 


k= 1 


kNN/lnformationGain 


0.2635 


0.3774 


130 


130 


k= 1 


kNN//2 statistic 


0.2537 


0.4128 


124 


124 


k= 1 


kNN 


0.2635 


0.3774 


137 


137 


k= 1 



microbiota signatures with best discrimination ability 
enabled us to identify low- and high-risk patients with 
distinct pneumonia classes (Additional file 1 Table 1). 
The results showed that shuttleworthia characterized as a 
distinct indicator of pneumonia in three-class problem, 
and acidaminococcus in two-class problem. It has been 
previously observed that shuttleworthia and acidamino- 
coccus are causes of pneumonia [28,29]. Of the top 20 
genera suspiciously contributing to the hospital-asso- 
ciated pneumonia [3], about half were found in the 
resulting combination with best discrimination ability in 
three-class problem (Additional file 1 Table 1). FMS dis- 
criminates microbial signatures efficiently, which will 
enable improved disease classification. Phylogenetic trees 
were constructed based on the nucleotide sequences of 
microbiota 16S rRNAs. It is noteworthy that the micro- 
biota signatures are dispersed in the phylogentic tree 
(Figure 4, 5), which indicates that the enormously diverse 



microbiota performs important functions for the host 
organism. FMS provides a combination of taxonomically 
wide set of microbiota signatures to evaluate agents' con- 
tribution to the infection. 

FMS algorithm was also tested on 16S rRNAs' profiles 
form dental decay patients. These samples were collected 
from saliva and dental plaques separately. For the expres- 
sion level of 16S rRNAs collected from dental plaques 
samples, the training data contains 23 dental decay 
patient samples and 20 normal samples and the test data 
contains 9 dental decay patient samples and 8 normal 
samples. For the expression level of 16S rRNAs collected 
from saliva samples, the training data contains 23 dental 
decay patient samples and 19 normal samples and the 
test data contains 10 dental decay patient samples and 8 
normal samples. As these dental decay datasets are noisy, 
mutual information method was used for feature deletion 
instead of Fisher statistic method. When treating with 



Table 2 Classification ability on pneumonia data in 2-class problem. 



Method 


Error rate 




Dimension 


Feature number 


Note 




On training data 


On test data 








svm/FMS 


0.0922 


0.1279 


42 


123 




svm/mRMR 


0.1313 


0.1977 


36 


36 




svm/KruskalWallis 


0.1081 


0.1628 


62 


62 




svm/lnformationGain 


0.1456 


0.186 


54 


54 




svm/x2 statistic 


0.1561 


0.186 


127 


127 




svm 


0.1611 


0.1977 


137 


137 




kNN/FMS 


0.1279 


0.2393 


20 


130 


k= 1 


kNN/mRMR 


0.2532 


0.3372 


54 


54 


k= 1 


kNN/KruskalWallis 


0.1861 


0.3343 


25 


25 


k = 4 


kNN/lnformationGain 


0.2248 


0.3256 


107 


107 


k= 1 


kNN/%2 statistic 


0.336 


0.4535 


107 


107 


k = 1 


kNN 


0.346 


0.4419 


137 


137 


k= 1 
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d 




Figure 3 The expression profiles of original pneumonia data for 3-class problem (a), data after treated by FMS for 3-class problem (b); 
original pneumonia data for 2-class problem (c) and data after treated by FMS for 2-class problem (d) Rows are microbiotas and 
columns are disease classes. From left to right are 30 normal, 32 CAP, 71 HAP samples for 3-class problem, and 30 normal 103 pneumonia 
samples for 2-class problem. 



the noisy data, the data showed that FMS also performed 
better than mRMR method [20] and Kruskal-Wallis test 
method [23] (Additional file 1 Table 2, 3). 

Conclusions 

In this work, we introduced FMS algorithm to address 
the high level sparsity and redundancy problem of 16S 
rRNA genes microbial surveys, thereby identifying com- 
binations of 16S rRNA genes that give the best discrimi- 
nation of sample groups. FMS method has several 
distinct advantages and features that make it useful to 
researchers: 1) FMS reduces feature dimension with 
higher accuracy and preserves the relationship between 
different features as well, thus improve the intelligibility 
of the result. 2) FMS processes features into sets of 
combinations and performs more efficiently and mean- 
ingfully in distinguishing among classifications than the 
individual features, which is in line with the observation 
that particular combinations of specific bacteria are 
associated with individual symptoms and signs [30]. 3) 
FMS uses combined features on classification perfor- 
mance, which may compensate for the influence of indi- 
vidual features, thus provides more robust classification 
with higher accuracy and less variation. 4) Different 
from LDA, FMS classifies features into combinations, 
features of different combinations were not overlapping 
and the relationship between features were well 
preserved. 



In conclusion, we developed a new feature merging 
and selection algorithm to deal with 16S rRNAs expres- 
sion data in order to reduce feature dimensionality and 
retain enough important features. The improved method 
reserves some advantages of both LDA and other feature 
selection methods, and reduces dimensions much more 
effectively. As the classification examples showed, the 
FMS algorithm reduced dimensionality of the data effec- 
tively without losing important features, which made 
results more intelligible. FMS performed well and will 
be useful in human microbiome projects for identifying 
biomarkers for disease or other physiological conditions. 

Data and method 

Data 

We got the 16S rRNAs' expression profiles of pneumonia 
patients from Zhou et al., [3], and 16S rRNAs' expression 
profiles of dental decay patients from Ling et al, . [4] . The 
set of 16S rRNAs' sequences, which were used for con- 
structing the phylogenetic trees, were downloaded from 
NCBI website (ID: GU737566 to GU737625, and 
HQ914698 to HQ914775) (http://www.ncbi.nlm.nih.gov). 
After removing redundant sequences, a total of 90 
microbe species were used for phylogenetic analysis. 

Linear Discriminant Analysis 

Linear Discriminant Analysis(LDA) is a typical variable 
transformation method to reduce dimensions [31]. The 
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22 Sphingomonas HQ914774 

18 Corynebacterium GU737578 
4 Propionibacterium HQ914771 

1 Actinomyces GU737566 

4 Pseudonocardia GU737576 
24 Mycobacterium GU737577 
3 Atopobium HQ91 4700 * 

15 Olsenella HQ914714 

11 Bifidobacterium GU737579 
Gardnerella H0914738 
Alloscardovia HQ914727 
Parascardovia HQ914739 

16 Rothia GU737580 

3 Shuttleworthia H0914721 ★ 
Clostridium GU737573 
14 Butyri vibrio HQ9 14749 
Bacteriovorax GU737583 
3 Parvimonas GU737581 * 
3 Finegoldia HQ914737* 
11 Campylobacter GU737584 

17 Desulfomicrobium H0914751 
11 Eubacterium GU737582 
Mycoplasma HQ914712 

3 Peptostreptococcus HQ914716 * 
3 Mogibactenum HQ914711 ★ 
11 Selenomonas HQ914720 

19 Dialister HQ914704 
Veillonella H0914725 

3 Megasphaera HQ914710 * 
3 Anaeroglobus HQ914729 * 
3Gemella HQ914706* 
3 Staphylococcus HQ914775* 

2 Dolosigranulum H0914764 

23 Granulicatella HQ914707 

I Abiotrophia HQ914757 

II Facklamia HQ914765 

3 Streptococcus HQ914722* 
3 Enterococcus HQ914735 ★ 
Weissella HQ914748 

3 Lactobacillus HQ914767 ★ 
8 Anoxybacillus HO914730 

7 Brevibacillus HQ914731 

10 Brevundimonas HQ914762 
17 Streptophyta HQ914726 
3 Peptococcus HQ914755 * 

8 Treponema H0914724 

7 Thermus HQ914747 

14 Phocaeicola HQ914756 
3 Prevotella GU737590 * 

1 Hallella HQ914753 

19 Porphyromonas GU737603 

8 Tannerella HQ914723 
SOdoribacter HQ914742 

15 Capnocytophaga GU737586 

20 Elizabethkingia GU737609 
Chryseobaeterium HQ914732 
Niabella GU737589 
Rhodocytophaga GU737608 

3 Comamonas HQ914763* 
Diaphorobacter GU737610 
4Acidovorax HQ914758 

4 Methylophilus GU737621 
Methyloversatilis GU737616 

24 Paracoccus HQ9 14770 
22 Rhodobacter GU737622 
7 Amaricoccus HQ914728 

3 Cardiobactenum HQ914750 ★ 

6 Aggregatibacter HQ914699 
17 Haemophilus HQ914708 
Aeromonas HQ914760 

5 Ferrimonas HQ914736 
26 Serratia HQ914745 
Klebsiella HQ914766 
Shigella HQ914773 

21 Moraxella HQ914769 

3 Acinetobacter GU737623* 
3 Enhydrobacter HQ914734* 
28 Sneathia HQ914746 
3 Fusobacterium HQ914705* 

3 Leptotrichia HQ914709* 

2 Achromobacter GU737618 

4 Limnobacter GU737615 
Ralstonia GU737613 

7 Cupriavidus HQ914733 
Neisseria GU737620 
3Kingella HQ914754* 

3 Methylobacterium HQ914740* 

4 Pseudoxanthomonas HQ9 14744 
3 Pseudomonas GU73761 1 * 

3 Stenotrophomonas GU737612 * 



Figure 4 Phylogenetic relationship of microbiota signatures in 3-class problem. The microbiota signatures with best discrimination ability 
were labeled with green star. 



key of LDA is to maximize the Rayleigh quotient: 
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1 

and 



classes scatter matrix". K is the number of classes, and 
is the number of the samples within the kth class, is 
the mean value of the sample within the kth class, and m 
is the mean value of all the samples. 

LDA method can find a direction which maximizes 
the projected class means and while minimizing the 
classes variance in this direction. To avoid S w become 
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Figure 5 Phylogenetic relationship of microbiota signatures in 2-class problem. The microbiota signatures with best discrimination ability 
were labeled with green star. 



singular matrix, we added unit matrix with small 
weights to S w in each loop until S w became non-singu- 
lar. The program can be downloaded from http://www. 
mathworks.com/matlabcentral/fileexchange/29673-lda- 
linear-discriminant-analysis/content/LDA.m 



Support vector machine algorithm 

Support vector machine (SVM) algorithm is one of the 
most popular supervised learning method basing on the 
concept of maximal margin hyperplane [32]. The hyper- 
plane separates training samples with 2 different labels, 
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from which both positive and negative categories have 
the largest distances. Multi-class problem will be trans- 
formed into binary class problem such as one-against- 
one or one-against-all. Kernels approach will be used to 
construct nonlinear decision boundary if the data is not 
linearly separable. We used Radial Basis Function kernel 
as follows: K(xi,Xj) = e~H*'~*;ll 2 / c , where c > 0, c is a 
scalar. 

k-nearest neighbor algorithm 

k-nearest neighbor algorithm (kNN) is a nonparametric 
method of supervised classification, basing on distance 
function d(x q , x,), such as Euclidean distance. The origi- 
nal data was preprocessed so that the values of each fea- 
ture in the data have zero mean and unit variance [33]. 
The distances of k nearest neighbors were weighted and 
labeled to refine the model, the improved kNN algo- 

k 

rithm is depicted as: F[x q ) = argmax^w 1 S(y,/(xj)), 



i/ev 



where 



1 



^I>A x i) i s the label of the iih sample; 
d[Xq, Xi J 

and <5(a,b) = 1 when a = b, otherwise <5(a,b) = 0. F{x q ) 
was assigned to when the distance between x q and 
Xi become zero [34]. Cross validation method were used 
to determine the k values. 

k means clustering method 

k means clustering is an unsupervised classification 
method for finding clusters and cluster centers. The 
method works in three steps: (1) Select the first kth 
samples as the seed mean; (2) Classify samples accord- 
ing to the nearest mean value; (3) End the loop when 
there is no change in the mean values. We used Eucli- 
dean distance as distance function. The program can be 
downloaded from http://people.revoledu.com/kardi/ 
tutorial/kMean/matlabkMeans.htm. Each feature was 
standardized to mean 0 and variance 1 in the training 
before the performance of k means clustering [33]. 

Mutual information 

Mutual information measures the mutual dependence 
between two variables based on information theory. The 
mutual information of two continuous variables x and Y 



is defined as: I(x,y) 



y) log 



pfry) 



■dxdy, 



1 p( x )p(y) 

Where p(x) and p(y) are the frequencies of appearances, 
and p(x, y) is the joint probabilistic density. 
In case of discrete variables, mutual information is 



Pix, y) 
p{x)p(y) 



). 



defined as: K x > Y ) = J2Y1 p ( x ' ^ lo §( 

We sorted the mean values of each feature class, com- 
puted average values of each adjacent values, and 



discretized each features according to the average values, 
then calculated the mutual information. Datasets with 
mutual information below 0.03 threshold were consid- 
ered as noisy data, thus mutual information method was 
used instead of Fisher statistic method at feature dele- 
tion step. 

To measure classification ability on noisy data, we dis- 
cretized features according to median value of classes 
for each feature, then compute mutual information. 

Minimum Redundancy Maximum Relevance 

Minimum Redundancy Maximum Relevance (mRMR) 
method is widely used for feature selection such as gene 
selection [35]. The Maximum Relevance is defined as: 

m / X jsi °^ ' Where I(x, y) is mutual information 



of two variables x and y, S is the selected vector set, g is 
a feature of S, and c is the class label. 
The Minimum Redundancy is defined as: 

The mRMR feature set is obtained by optimizing the 
Maximum Relevance and Minimum Redundancy simul- 
taneously. Optimization of both conditions requires 
combining them into a single criterion function. In this 
paper, the m-th feature was selected according to the 
value of Maximum Relevance divided by Minimum 



max 

Redundancy [20]: g,eG-s m _! 



1 



m — 1 



mRMR method need to discrete training data before 
running, so considering sparse discrete of the data, we 
assign 1 for features with expression information and 0 
for features without expression. The mRMR program 
can be downloaded from web site: http://penglab.janelia. 
org/proj/mRMR/ 

Kruskal-Wallis test 

Kruskal-Wallis test is a non-parametric method for test- 
ing whether samples originate from the same distribu- 
tion [23]. The test assumes that all samples from the 
same group have the same continuous distribution, and 
they are mutually independent. In this study, Kruskal- 
Wallist test was used to rank features. The program can 
be downloaded from http://featureselection.asu.edu/ 
algorithms /fs_sup_kruskalwallis . zip . 

Information Gain 

Information Gain measures the classification ability 
of each feature with respect to the relevance with 
the output class, which is defined as Information 

Gain = H(S)-H(S|x) [27], H ( S ) = " £p(*)l°g 2 (p(s)), 
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h(S|x) = -X>(*)£p( s M lo g2(P(*W), where s and 

xeX seS 

x are features. When measuring the mutual relation 
between the extracted features and the class, Informa- 
tion Gain is also known as mutual information [21]. 
We assigned 1 to features with expression information 
and 0 to features without expression, and ranked the 
Information Gain values; the larger the value, the 
more important is the feature. 

% 2 statistic 

The Chi-squared (x 2 ) statistic uses thex 2 statistic to dis- 
cretize numeric attributes and achieves feature selection 
via discretization [22]. Thex 2 value is defined as 

X 2 = £ £ - — - — , where c is the number of 

i=l ;=1 b <i 

intervals, k is the number of classes, Ay is the number 
of samples in the /th interval and the /th class, M t is the 
number of samples in the /th interval, Bj is the number 
of samples in the /'th class, N is the total number of 

samples, and Ey = ^ e assi g ne d 1 10 r features 

with expression information and 0 for features without 
expression, and sorted thex 2 statistic values, the lager 
the value, the more important is the feature. 

Additional material 
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